Add `rna` command: RNA-seq QC metrics (CollectRnaSeqMetrics + transcript insert size) by tfenne · Pull Request #40 · fulcrumgenomics/riker

tfenne · 2026-06-27T11:13:20Z

Summary

Adds the rna command: a single-pass, single-gene-model-load RNA-seq QC tool. It started as a port of Picard CollectRnaSeqMetrics (base composition, strand specificity, transcript 5'/3' coverage bias and CV) + fgbio EstimateRnaSeqInsertSize (transcript-space insert size, introns collapsed), and has grown into a superset that also covers the RSeQC / RNA-SeQC / Qualimap families: read accounting and genomic origin, gene detection, splice-junction annotation, transcript integrity (TIN), and per-biotype read counts. The collector is wired into multi.

riker rna -i sample.bam -o out_prefix --gene-model gencode.gtf.gz

Outputs (house conventions — lowercase headers, frac_, no metadata lines):
out.rna-metrics.txt (one wide row), out.rna-biotype.txt(+.pdf), out.rna-gene-expression.pdf, out.rna-coverage.pdf, out.rna-insert-size.txt / -histogram.txt(+.pdf).

Metrics (one `rna-metrics.txt` row)

Read accounting & genomic origin — total / mapped / unmapped / ignored / duplicate-rate; exonic / intronic / intergenic / ambiguous / ribosomal, with assigned_reads and fractions.
Gene detection — genes_detected (genes with ≥ --genes-detected-min-reads).
Splice junctions — known / partial / novel observations and distinct junctions vs. the annotation, plus frac_known_juncs.
Base composition — coding / UTR / intronic / intergenic / ribosomal bases and fractions, frac_mrna_bases, frac_usable_bases.
Strandedness — detected strand, correct-strand fraction, R1/R2 transcription-strand counts.
Coverage uniformity — median CV, 5'/3'/5'→3' bias, and TIN (median / mean / stddev / transcript count).

Highlights

Multiple gene-model formats, auto-detected from content: UCSC refFlat, GTF, GFF3 (GENCODE / Ensembl / RefSeq), with contig-name reconciliation (chr add/strip, MT↔chrM, RefSeq NC_* → common name). Gzip handled transparently.
Strandedness auto-detection (--strand auto, default) from observed R1/R2 transcription-strand counts; unstranded/forward/reverse force a model.
Ribosomal regions = union of biotype-derived rRNA/Mt_rRNA genes and an optional --ribosomal-intervals file (BED or IntervalList).
Transcript-space insert size per pair orientation (FR/RF/tandem). Uses the MC (mate CIGAR) tag when present (exact, any orientation); when MC is absent it estimates FR pairs from a spec-compliant TLEN at the forward read rather than dropping them — see below.
Plots always produced, Fulcrum-branded: per-biotype bar chart, per-gene RPK expression distribution (all genes vs. protein-coding), normalized transcript coverage, and the insert-size histogram (reusing the shared isize helper so the two match).

TIN scale & validation

riker's TIN is the largest deliberate numeric difference from the reference tools: it scores one well-covered transcript per gene, so its median runs ~14–16 points higher than RSeQC/RustQC, which score every isoform. The two report the same shape on a different scale. This is validated in docs/rna-integrity.md on a controlled RNA-degradation ladder (Sigurgeirsson et al. 2014, RIN 10→2) plus a depth-downsampling sweep: both implementations track degradation in lockstep, and riker is markedly more depth-robust (≈0.5% vs ≈4% TIN change across a 5× depth reduction). Indexing and alignment were both done with rustar-aligner against GENCODE v50, for a single-tool story.

Intentional differences vs. Picard/fgbio (documented in ERRATA.md)

Deliberate corrections/choices, not regressions; on identical inputs read/base accounting matches Picard to the base and composition agrees within ~0.5%:

Coverage/bias fixes: top-1000 transcripts (Picard keeps 1001 via an off-by-one); short transcripts (< 2×end_bias_bases) excluded from the CV/bias set; every aligned base counted (Picard drops the last base of each block); non-overlapping normalized-position bins.
Ribosomal fragment overlap: union of merged overlaps (Picard takes the single largest, Math.max); fragment end from the MC tag, falling back to TLEN, never read length.
Insert-size mate span — MC, else a TLEN best guess. With MC the mate's exact blocks are used (any orientation). Without it, FR pairs are kept by deriving the mate's right end from start + |TLEN| - 1 at the forward read: the 5'-to-5' transcript-space size is then exact, and the unknown mate splicing is gated by requiring both mate ends in an exon plus a ≥95% collapsed-exonic footprint. Validated against the MC path on a real BAM (9.7M FR pairs; mean/median agree within ~0.5%). A BAM with neither MC nor a valid TLEN yields no/incorrect insert sizes — samtools fixmate -m remains the way to guarantee exactness.
Insert-size acceptance: a pair is used when exactly one gene encloses the whole pair span (plus per-transcript overlap and agreement checks), rather than fgbio's "exactly one gene overlaps the span." A non-enclosing bystander gene near the pair no longer disqualifies it (~7% more pairs, <1–2 bp shift).
Gene → locus → transcript model: a gene's transcripts are grouped into loci (same contig/strand, overlapping spans), so a gene at disjoint locations becomes independent loci.
Input must be coordinate-sorted (@HD SO:coordinate); validated at startup, enabling the allocation-free locus sweep.

Dropped Picard's per-position *.rna-coverage.txt table — Picard emits it only because its plotting engine is external; riker plots internally, so only the .pdf is kept.

Performance & memory

Profiled and tuned after the correctness-complete implementation (composition + coverage verified byte-identical at every step on a 75.2M-read PE BAM, GENCODE v29):

Throughput ~250k → ~2M reads/s (≈7×) — linear-time per-base work, classify-once-per-locus, block×interval-arithmetic classification (no per-base array), and a coordinate sweep replacing the interval tree.
Peak RSS 2.16 GB → ~0.8 GB (≈2.75×) — stream the gene model line-by-line (fgoxide) instead of slurping ~1.2 GB into a String, and store coverage once per locus over its exon union (reconstructing per-transcript coverage at finish).

The metrics-expansion (read origin, junctions, biotype, TIN) added ~1.5% runtime. riker stays ~10× faster single-threaded than RustQC and far lighter. BGZF decompression (~11%) is the remaining single-threaded floor.

Testing & review

Unit + integration tests for the command and the three parsers (all data built programmatically; no checked-in fixtures), plus the standard suite — 639 tests pass, fmt / clippy --all-targets (pedantic) / nextest all clean.
Validated against Picard CollectRnaSeqMetrics (read accounting exact; composition ~0.5%; coverage/bias ~1.3%, in the corrected direction), cross-checked vs. RustQC (junctions, TIN, biotype), and the TIN dose-response validated end-to-end on real degradation-ladder data.

Out of scope (tracked separately)

A curated GRCh38 rDNA BED / rRNA interval list with methodology, plus per-biotype/expression refinements — to be their own changes.

coderabbitai · 2026-06-27T11:26:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new gene-model module for refFlat, GTF, and GFF3 loading and block classification, wires a new rna CLI subcommand and multi support, and documents the command in the README, changelog, and ERRATA. It also adds startup output-prefix writability checks across commands, updates insert-size histogram handling to use a shared PDF writer, and expands tests and metric docs.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding the new rna command for RNA-seq QC metrics and transcript insert size.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description matches the changeset, covering the new rna command, docs updates, output files, and related behavior changes.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tf_rna

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

tests/test_rna.rs (1)
247-247: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the shared float assertion macro in these tests.

These ad-hoc abs() < eps checks are inconsistent with the test suite’s standard helper and give worse failure output. Replace them with assert_float_eq!(a, b, eps). As per coding guidelines, tests/**/*.rs: "Use assert_float_eq!(a, b, eps) macro for float comparisons in tests".

Also applies to: 335-339, 411-424, 438-439, 455-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_rna.rs` at line 247, Replace the ad-hoc float comparison in the
RNA tests with the shared float assertion macro. In the affected test cases
around the frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/commands/common.rs`:
- Around line 22-25: Update the directory resolution logic in `common::` so root
prefixes like `/` or Windows drive roots are treated as the filesystem root
instead of falling back to `.`. The current `prefix.parent()` handling in the
`dir` selection should preserve root paths when `Path::parent()` returns `None`
or an empty parent. Add a regression test around the startup check path to cover
a root prefix case and verify it does not incorrectly pass as the current
directory.

In `@src/commands/rna.rs`:
- Around line 671-678: The top-1000 selection in the RNA command can include
more than 1000 transcripts when multiple loci share the cutoff mean, so tighten
the filtering in the selection logic around best_per_locus/means to
deterministically keep exactly 1000 entries. Update the threshold-based check
and the later loop that builds the retained transcript set so ties at the
boundary are broken in a stable way or truncated after sorting, ensuring n_tx
cannot exceed 1000 and the downstream bias/CV medians reflect the intended
top-1000 behavior.
- Around line 1230-1240: Update mate_alignment_end to follow the command’s
fragment-end contract: when mate_cigar/mate_alignment_blocks cannot derive the
mate end from MC, fall back to TLEN before using the read length. Locate the
logic in mate_alignment_end and replace the current sequence-length-based
fallback with TLEN-derived end coordinates, keeping the existing MC path intact
so pair-span bounds used by ribosomal classification and strand-template
inference remain correct.

In `@src/gene_model/gff.rs`:
- Around line 109-114: The tx_order loop in gff.rs is still indexing features
with features[&id], which will panic when an exon/CDS references an orphan
Parent ID that never created a feature record. Update this lookup in the parsing
path to use a safe get-based check in the same block that already handles
exons_by_parent, and either return a parse error or skip the orphan explicitly.
Keep the rest of the transcript/gene resolution logic intact by guarding the
feature access before using feature.parents.first() and the subsequent gene
lookup.

In `@src/gene_model/mod.rs`:
- Around line 447-449: `detect_format` is currently swallowing read/sniffing
failures by using `line.ok()?`, which turns I/O or decompression problems into
an unknown format result. Update `detect_format` in `gene_model::mod` to return
`Result<Option<GeneModelFormat>>` instead of `Option`, and propagate the
`BufRead::lines()` read error upward rather than converting it to `None`. Make
sure callers that rely on `detect_format` handle the error case explicitly,
while preserving the existing format-detection logic for successfully read
lines.
- Around line 514-518: The transcript clustering logic in the
`clusters.iter_mut().find(...)` block is comparing `other.contig` and
`tx.contig` as raw strings, which can split the same locus when contig aliases
differ. Update the comparison to use the resolved BAM contig identity instead of
the original names, reusing the existing contig-resolution logic used elsewhere
in `src/gene_model/mod.rs` before calling `spans_overlap`. Make sure transcripts
like `1`/`chr1` or `MT`/`chrM` are grouped into the same cluster when they map
to the same underlying contig.

In `@src/plotting.rs`:
- Around line 129-130: The early return in plotting logic leaves an old PDF
behind when no series qualify, so update the empty-case handling in the plotting
routine in `src/plotting.rs` to clear any previously generated output at `path`
before returning. Make the fix in the same branch that checks `plots.is_empty()`
so stale histogram files are removed whenever the current run produces no plot.

---

Nitpick comments:
In `@tests/test_rna.rs`:
- Line 247: Replace the ad-hoc float comparison in the RNA tests with the shared
float assertion macro. In the affected test cases around the
frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 185780c6-cb77-421d-986b-79d5f06a8684

📥 Commits

Reviewing files that changed from the base of the PR and between 3763b9a and 72afd23.

📒 Files selected for processing (25)

CHANGELOG.md
README.md
src/commands/alignment.rs
src/commands/basic.rs
src/commands/common.rs
src/commands/docs.rs
src/commands/error.rs
src/commands/gcbias.rs
src/commands/hybcap.rs
src/commands/isize.rs
src/commands/mod.rs
src/commands/multi.rs
src/commands/rna.rs
src/commands/wgs.rs
src/gene_model/gff.rs
src/gene_model/gtf.rs
src/gene_model/mod.rs
src/gene_model/refflat.rs
src/intervals.rs
src/lib.rs
src/main.rs
src/plotting.rs
tests/helpers/mod.rs
tests/test_multi.rs
tests/test_rna.rs

…t prefix) Three fixes from the PR #40 review: - mate_alignment_end: fall back to the TLEN-derived fragment end (min(read_start, mate_start) + |TLEN| - 1, exactly Picard's CoordMath.getEnd(fragmentStart, abs(insertSize))) when the MC tag is absent, instead of the read's own length. This matches the documented contract (the module doc and README both say 'falling back to TLEN, never read length') and the values fed to the ribosomal and strand-template fragment-span checks. Only affects pairs lacking MC; output on MC-tagged data is byte-identical. Extracted tlen_fragment_end as a pure, unit-tested helper. - detect_format: return Result<Option<GeneModelFormat>> and propagate line read / decompression errors instead of collapsing them to 'could not detect format' via line.ok()?. The gzip-without-extension hint now runs up front so that case still gets a targeted message. - validate_output_prefix: treat a root prefix ('/', 'C:\') as the directory to check rather than falling back to '.'. Declined (with rationale): tie-handling >1000 transcripts (f64 means don't tie in practice), gff features[&id] 'panic' (false positive — every tx_order id is inserted into features), cluster-on-resolved-contigs (real models use one contig convention per file), and remove-stale-PDF (deleting a user's file to avoid a rare stale-output case is the wrong trade-off). 619 tests pass; output byte-identical on the MC dataset.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

ERRATA.md (2)
228-230: 📐 Maintainability & Code Quality | 🔵 Trivial

Add language specifier to fenced code block.
- ```
+ ```text
  TIN = 100 · e^H / n,   where  H = −Σ pᵢ·ln pᵢ,   pᵢ = cᵢ / Σⱼ cⱼ
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @ERRATA.md around lines 228 - 230, Add a language specifier to the fenced
code block in the TIN formula snippet so the markdown renders as a text block;
update the existing fence around the equation to use a text code fence while
keeping the formula content unchanged.
</details>



_Source: Linters/SAST tools_

---

`237-239`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_

**Add language specifier to fenced code block.**

```diff
- ```
+ ```text
  TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) )
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @ERRATA.md around lines 237 - 239, Add a language specifier to the fenced
code block in the ERRATA documentation snippet so it is marked as text (for
example, the block containing TIN = 100 · exp( −D_KL(normalized coverage ‖
uniform) ) should use a text fence). Update the markdown fence itself and keep
the equation content unchanged.
</details>



_Source: Linters/SAST tools_

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @tests/test_rna.rs:

Around line 653-654: The strand-count check in the RNA test is too loose and
no longer detects double-counting; update the assertion on m.r1_tx_strand_reads
to require the exact expected value using the existing metric contract in this
test, and keep the companion m.r2_tx_strand_reads assertion unchanged. Use the m
variable in the test case with 10 FR pairs so the assertion verifies the precise
strand count rather than a minimum threshold.

Nitpick comments:
In @ERRATA.md:

Around line 228-230: Add a language specifier to the fenced code block in the
TIN formula snippet so the markdown renders as a text block; update the existing
fence around the equation to use a text code fence while keeping the formula
content unchanged.

Around line 237-239: Add a language specifier to the fenced code block in the
ERRATA documentation snippet so it is marked as text (for example, the block
containing TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) ) should use a
text fence). Update the markdown fence itself and keep the equation content
unchanged.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `7d97d606-38d7-4417-8a19-db82feb1d955`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between d425634d41ea6a16065f1f06723db39786ed6382 and 37e9993e504b51f396d1dc5573c782eb14082a7c.

</details>

<details>
<summary>📒 Files selected for processing (8)</summary>

* `CHANGELOG.md`
* `ERRATA.md`
* `README.md`
* `src/commands/docs.rs`
* `src/commands/rna.rs`
* `src/gene_model/mod.rs`
* `src/gene_model/refflat.rs`
* `tests/test_rna.rs`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (1)</summary>

* CHANGELOG.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* src/commands/docs.rs
* src/gene_model/mod.rs
* src/commands/rna.rs

</details>

</details>

CodeRabbit (PR #40): the auto_detects_forward_strandedness test builds exactly 10 FR pairs, so r1_tx_strand_reads is deterministically 10. Use assert_eq!(.,10) instead of >= 10 so the test catches template double-counting.

coderabbitai

🧹 Nitpick comments (1)

docs/rna-integrity.md (1)
26-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to fenced code block.

The formula block at Line 26 lacks a language tag, triggering markdownlint-cli2 MD040. Add text or math to silence the lint and improve renderer consistency.
- ```
+ ```text
  TIN = 100 · e^H / n,   H = −Σ pᵢ·ln pᵢ,   pᵢ = cᵢ / Σⱼ cⱼ
      = 100 · exp( −D_KL(coverage ‖ uniform) )
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/rna-integrity.md around lines 26 - 29, The fenced formula block in the
RNA integrity doc is missing a language specifier, which triggers markdownlint
MD040. Update the fenced block around the TIN formula in the markdown so it uses
a language tag such as text or math, keeping the existing content unchanged to
satisfy the linter and improve rendering consistency.
</details>



_Source: Linters/SAST tools_

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @docs/rna-integrity.md:

Around line 26-29: The fenced formula block in the RNA integrity doc is
missing a language specifier, which triggers markdownlint MD040. Update the
fenced block around the TIN formula in the markdown so it uses a language tag
such as text or math, keeping the existing content unchanged to satisfy the
linter and improve rendering consistency.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `08c3800f-4c63-4fe7-a1b1-ef843981e7dd`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between a22cec3e6a11eb627511a5a63108b693efa84912 and a3ff63523bb514924f398be200c7d27a9f61d8a1.

</details>

<details>
<summary>⛔ Files ignored due to path filters (2)</summary>

* `docs/images/rna-integrity-rin-vs-tin.png` is excluded by `!**/*.png`
* `docs/images/rna-integrity-tin-vs-depth.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (7)</summary>

* `ERRATA.md`
* `README.md`
* `docs/rna-integrity.md`
* `src/commands/docs.rs`
* `src/commands/rna.rs`
* `src/plotting.rs`
* `tests/test_rna.rs`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (2)</summary>

* README.md
* ERRATA.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* src/commands/docs.rs
* src/plotting.rs
* tests/test_rna.rs

</details>

</details>

Ports CollectRnaSeqMetrics + transcript insert size from Picard/fgbio into a new `riker rna` subcommand, with supporting gene-model parsing and plotting. - Gene models: GTF, GFF (RefSeq NC_* accession mapping), and refFlat parsers under src/gene_model/, streamed via fgoxide rather than slurped to a String. - Hot path: per-locus exon-union coverage with a coordinate-sweep locus lookup and enclosure-based insert size; linear-time per-base classification. - Metrics: read origin, gene detection, splice junctions, TIN, biotype output, expanded QC plots, and a TLEN insert-size fallback. TIN validated on GENCODE v50. - Integrates with the multi command (CollectorKind::Rna) and the docs registry. - Startup hardening: validate output writability at startup (fail-fast). - Docs: rna tool docs plus intentional differences vs Picard/fgbio split into ERRATA.md and docs/rna-integrity.md. Squashed from 20 commits on tf_rna (preserved in tf_rna_pre_rebase); the underlying feature is a single new command.

Correctness / robustness: - GTF/GFF3: fold the `stop_codon` feature into the coding extent so cds_end includes the stop codon, matching Picard and the refFlat path. GENCODE emits stop_codon as a separate feature that CDS excludes; RefSeq/refFlat already include it. - refFlat parser: enforce txStart<txEnd and exon/CDS within the transcript span, so the unchecked u32 subtractions in count_blocks/build_loci can't silently wrap on a malformed file — reject it instead. - rna: saturating_mul on the `2 * end_bias_bases` short-transcript gate so a pathological `--end-bias-bases` can't wrap it and panic the coverage slices. - gene-model format auto-detect: choose GTF vs GFF3 by which key/value separator appears first, so a GTF value containing '=' isn't misread as GFF3. - gene-model parse errors now carry the file path and detected format. Cleanup / placement: - Move overlap_inclusive/encloses_inclusive out of gene_model (rna is the only caller) into rna.rs and drop them from the public API. - Move the MateSpan helper struct out of the module-function section. - intervals::read_file_contents back to private, and fix its stale doc: it is not shared with the gene-model reader (which streams via fgoxide). - Wrap the rna `--help` long_about to house style; mark the unused FG_PINE brand colour `#[allow(dead_code)]`; fix the module doc that overstated what `--exclude-duplicates` drops. - CHANGELOG: correct the rna output-file list (drop the non-existent .rna-coverage.txt; add the biotype/gene-expression charts). Tests: rna-through-multi integration test; stop_codon extension (GTF + GFF3); refFlat span validation; contig reconciliation; format auto-detect '='. Deferred to follow-ups (#20, #21): reduce rna per-read CIGAR allocation; evaluate the noodles gtf/gff parsers. Skipped as theoretical: the split-MC insert-size dedup edge case.

GFF3 percent-encodes reserved characters (; = & , % tab newline) inside attribute values; our hand-rolled parser left them encoded, so a gene name or Name containing an encoded comma/semicolon came through mangled. Decode values at extraction (gene_name / Name / gene_id / transcript_id / ID, biotype, Parent, and the RefSeq region accession->name mapping). Parent is split on the raw ',' separator first, then decoded, so an encoded comma (%2C) inside one id is not treated as a separator. percent-encoding was already in the tree transitively; promoted to a direct dep. This closes out the noodles-gtf/gff parser evaluation. A full switch to noodles was ~3.5x slower on the gene-model parse (8.4s vs 2.4s on GENCODE v29 GTF, since noodles builds a full percent-decoded attribute map per line) and left all the source-specific grouping / biotype-key / RefSeq-accession logic ours -- so percent-decoding, the one real gap noodles would have closed, is added directly to the hand-rolled parser instead.

bases_overlapping_exons was the single largest self-time item in the --threads 3 hot-path profile (~6% self). It summed the overlap of every alignment block against every exon of a transcript — an O(blocks×exons) product — even though a read's blocks and a transcript's exons are both sorted by start and internally disjoint, so each block can only overlap a short contiguous run of exons. Replace the nested product with the same partition_point + break merge-walk already used by overlap_with_intervals: for each block, binary-search the first exon that can overlap and scan only until an exon starts past the block. O(blocks + exons) with early termination. Output is byte-identical on the 80M-read ENCODE BAM (all four rna TSVs). Instructions retired drop ~1.5% (~600.4B -> ~591.2B, consistent across runs); wall time on this laptop is thermal-noise-dominated and can't resolve a change this size.

classify_bases and tally_insert_size each called rec_alignment_blocks on the same record independently, so every paired read walked and allocated its CIGAR alignment blocks twice. Parse them once at the top of accept() into a local Vec and pass &[(u32,u32)] to both; classify_bases no longer needs the record at all. The block Vec is deliberately a local passed by reference, not a field on the collector: a self-owned scratch buffer read in a loop that also mutates other self fields forces aliasing-conservative reloads (measured ~3.5% regression previously), whereas a provably non-aliasing local does not. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~2.4% (~590.2B -> ~575.8B, consistent across runs).

The MC (mate CIGAR) tag was parsed up to three times per read: once in classify_ribosomal and once in tally_strand (both via mate_alignment_end, each of which copied the tag bytes and re-walked the CIGAR only to take its reference end) and a third time in resolve_mate_span for insert size. For a paired RNA read that path meant ~3 tag copies + 3 CIGAR parses + their allocations, every read. Parse it once at the top of accept() into a MateAlign { Absent, Malformed, Parsed { blocks, end } }, computed only for same-contig mapped pairs (the only reads any consumer uses it for), and thread &MateAlign through the three callers: - mate_alignment_end -> mate_end_or_tlen(record, mate_start, &MateAlign): reads the precomputed end, same TLEN fallback for Absent/Malformed. - resolve_mate_span borrows the parsed blocks via Cow (owned only for the rare MC-absent TLEN estimate), so MateSpan no longer allocates on the exact path. Absent conflates 'not a same-contig pair' with 'no MC tag', which is sound: every consumer that could see the former already returns before using the value (different-contig strand reads have proper=false, so the mate end is unused). Malformed maps to the same skip the old produced. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~5.3% (~573.9B -> ~543.5B) and wall time ~29.2s -> ~27.3s this run; cumulative with the two prior commits ~9.5% fewer instructions vs the start of this pass.

rec_alignment_blocks, mate_alignment_blocks, and rec_introns each returned a freshly heap-allocated Vec per read; on the 80M-read BAM that is on the order of 150M+ small allocations, and rpmalloc alloc/dealloc was ~8% of the compute thread's self time in the profile. Return a SmallVec<[(u32,u32); 4]> (aliased CigarBlocks) instead: a read's M/=/X blocks (unspliced 1, one junction 2, ...) and its introns fit inline in the common case, so the parse never touches the heap; only reads spanning more than four blocks spill. Every consumer already took a slice (&[(u32,u32)]) or borrows via MateSpan's Cow, so nothing downstream changed. The MateAlign::Parsed blocks field carries the same SmallVec. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~3.75% (~544.0B -> ~523.6B) and wall time ~27.6s -> ~26.6s; cumulative over this pass ~12.8% fewer instructions vs the start.

accept() allocated a fresh Vec<u32> for the gene loci overlapping each read. Most reads fall in zero or one locus, so this was ~one small heap allocation per read for a 1-2 element vector. Return the sweep results into a SmallVec<[u32; 4]> (aliased LociVec) instead: zero/one/few-locus reads stay inline and never touch the heap; only reads over more than four overlapping genes spill. overlapping_loci still clear()s and extend()s it, and every consumer already took &[u32], so nothing else changed. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~1.4% (~523.6B -> ~516.4B); cumulative over this pass ~14% fewer instructions vs the start.

parse_mate_align went through mate_cigar, which did record.aux_tag(MC)? .as_str().map(|s| s.to_vec()): aux_tag already hands back an owned AuxValue (the aux store is cloned out), and .to_vec() then allocated and memcpy'd a *second* copy of the same bytes for every paired read carrying an MC tag. Inline mate_cigar into parse_mate_align and move the Vec out of the owned AuxValue::String via a let-else, so there's exactly one allocation (the one aux_tag already made) and no extra memcpy. A missing or non-string MC tag maps to Absent exactly as the old Option path did. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~1.7% (~516.6B -> ~507.9B); cumulative over this pass ~15.4% fewer instructions vs the start.

On the multi-locus classification path, classify_blocks unioned the overlapping genes' exon and coding intervals by concatenating them and calling merge_intervals, which sort_unstable()s the whole set. But each locus's exon_union / coding_union is already sorted and disjoint by construction, so the sort is redundant work — and quicksort was ~3% of the compute thread's self time in the profile, almost all of it from this per-read path. Add union_of_sorted_disjoint (a pairwise merge of pre-sorted sets) and use it for the coding and exon unions; the per-locus spans stay on merge_intervals since there are only a few and they aren't pre-sorted as a set. Factor the coalescing step both share into push_coalesced. The union is the same set regardless of merge order, so output is unchanged; added unit tests pin the k-way result to merge_intervals' and cover overlap, adjacency, dedup, and empty/single-set inputs. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~3.5% (~507.8B -> ~490.1B) and wall time ~26.2s -> ~24.8s; cumulative over this pass ~18% fewer instructions vs the start.

Code review found that bases_overlapping_exons's merge-walk (binary-search + early-exit) silently assumed a transcript's exons are pairwise disjoint — a precondition the parsers never enforce (they only sort). A malformed or duplicate exon record for one transcript (from hand-edited or merged annotation files) breaks the partition_point's monotonic-end precondition, which could undercount to zero or, in debug builds, panic on u32 underflow. Make the invariant real: normalize_exons() sorts each transcript's exons and coalesces any that overlap, so the exon list handed to every downstream interval walk is sorted and pairwise disjoint. On real annotations (whose per-transcript exons never overlap) this only sorts — a no-op that keeps output byte-identical on the 80M-read ENCODE BAM. It does not change the locus exon_union at all (merge_intervals already coalesced the same intervals), so gene-level coverage is untouched; the only effect is that a malformed overlapping-exon transcript now counts each exonic base once (sane deduplicated per-transcript coverage) instead of double-counting. Keeps the merge-walk's full speed (the ~2.3% it's worth); normalize is a one-time load cost (~0.3%).

…d-MC coverage Follow-ups from the perf-pass review, no behavior change: - Delete a stale doc line left over from the removed mate_cigar fn, which had glued itself onto parse_mate_align's doc block. - Move the smallvec import into the third-party group (was appended after the crate:: group). - Rename mate_alignment_end_includes_trailing_skip -> _blocks_ (it exercises mate_alignment_blocks; the old function name no longer exists). - Point the bases_overlapping_exons doc at the newly-enforced disjoint-exon invariant. - Add insert_size_skips_pair_with_malformed_mc: pins that a present-but- unparseable MC tag (MateAlign::Malformed) skips the pair rather than falling through to the TLEN estimate.

coderabbitai Bot requested changes Jun 27, 2026

View reviewed changes

Comment thread src/commands/common.rs

Comment thread src/commands/rna.rs

Comment thread src/commands/rna.rs Outdated

Comment thread src/gene_model/gff.rs

Comment thread src/gene_model/mod.rs Outdated

Comment thread src/gene_model/mod.rs

Comment thread src/plotting.rs

coderabbitai Bot approved these changes Jun 27, 2026

View reviewed changes

coderabbitai Bot requested changes Jun 27, 2026

View reviewed changes

Comment thread tests/test_rna.rs Outdated

coderabbitai Bot approved these changes Jun 27, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

tfenne force-pushed the tf_rna branch from a3ff635 to 38056e1 Compare July 2, 2026 20:11

tfenne added 12 commits July 2, 2026 21:49

tfenne force-pushed the tf_rna branch from ee675ae to 62a3203 Compare July 3, 2026 03:53

tfenne merged commit 8fae183 into main Jul 3, 2026
5 checks passed

tfenne deleted the tf_rna branch July 3, 2026 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `rna` command: RNA-seq QC metrics (CollectRnaSeqMetrics + transcript insert size)#40

Add `rna` command: RNA-seq QC metrics (CollectRnaSeqMetrics + transcript insert size)#40
tfenne merged 12 commits into
mainfrom
tf_rna

tfenne commented Jun 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tfenne commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics (one rna-metrics.txt row)

Highlights

TIN scale & validation

Intentional differences vs. Picard/fgbio (documented in ERRATA.md)

Performance & memory

Testing & review

Out of scope (tracked separately)

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tfenne commented Jun 27, 2026 •

edited

Loading

Metrics (one `rna-metrics.txt` row)

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading